Hierarchical Clustering#

Hierarchical clustering is a method of cluster analysis that seeks to build a hierarchy of clusters (not just partition of the sample into disjoint sets, but a system of nested partitions).

There are two stategies for hierachy clustering:

  • Divisive or top-down algorithms split the sample into smaller and smaller clusters

  • Agglomerative or bottom-up algorithms, in which objects are combined into larger and larger clusters

../_images/aglo_and_divisive.jpeg

The agglomerative approach is more popular and effective. The algorithm itself looks pretty simple and consists of the following steps:

  • Initialization: Create as many clusters as there are objects, each object in its own separate cluster.

  • Repeat: Do following steps until the stopping criterion is met

    • Compute distances: Calculate the pairwise distance or similarity between each pair of clusters.

    • Merging: Combine of the two closest clusters into a new cluster.

Distance Matrix and its calculation#

The main data structure used by hierarchical clustering algorithms, which allows them to keep track of distances between clusters, is the distance matrix.

Formally, if the dataset has n data points \(\{x_1, \ldots, x_n\}\), a distance matrix \(D\) is an \(n \times n\) matrix, where \(D_{i j}\) is the distance between points \(x_{i}\) and \(x_{j}\). By definition, the distance matrix is symmetric and its diagonal entries are all zeros.

Usually, the distance between two data points \(x, y \in\mathbb R^d\) is calculated with metrics such as:

  • Euclidean distance: \(\hspace{2mm} \rho(x, y)=\sqrt{\sum\limits_{i=1}^d(x_i - y_i)^2}\)

  • Squared Euclidean distance: \(\hspace{2mm} \rho(x, y)=\sum\limits_{i=1}^d(x_i - y_i)^2\)

  • Manhattan Distance: \(\hspace{2mm} \rho(x, y)=\sum\limits_{i=1}^d |x_i - y_i|\)

  • Cosine Distance: \(\hspace{2mm} \rho(x, y)=\frac{\sum\limits_{i=1}^dx_i y_i}{\sqrt{\sum\limits_{i=1}^dx_i^2}\sqrt{\sum\limits_{i=1}^dy_i^2}}\)

Linkage and Distance#

At first, each object is considered as one cluster. For single element clusters, the distance function is naturally defined:

\(R(\{x\},\{y\})=\rho (x, y)\)

The distances between objects \(\rho\) can be specified by any metric, both Euclidean and Manhattan distance or, for example, the cosine measure. Then the process of clusters merging starts. At each iteration, instead of the pair of closest clusters \(U\) and \(V\), a new cluster \(W = U \cup V\) is formed. There are several methods of identifing closest clusters:

  • Single linkage aka Nearest Point Algorithm: The distance between clusters is defined by the distance between their closest members.

../_images/single.jpeg
\[\hspace{2mm} R_{min}(U, V)=\min\limits_{u \in U, v \in V}\rho (u, v)\]
  • Complete linkage aka Farthest Point Algorithm: The distance between clusters is defined by the distance between their furthest members.

../_images/complete.jpeg
\[\hspace{2mm} R_{max}(U, V)=\max\limits_{u \in U, v \in V}\rho (u, v)\]
  • Unweighted Pair Group Method with Arithmetic mean (UPGMA): Average-average distance or average linkage is the method that involves looking at the distances between all pairs and averages all of these distances.

../_images/avg.jpeg
\[\hspace{2mm} R_{avg}(U, V)=\dfrac{1}{|U|\cdot|V|}\sum\limits_{u \in U}\sum\limits_{v \in V}\rho (u, v)\]
  • Unweighted Pair Group Method with Centroid average (UPGMC): A point defined by the mean of all points (centroid) is calculated for each cluster and the distance between clusters is the distance between their respective centroids.

../_images/center.jpeg
\[\hspace{2mm} R_{cen}(U, V)=\rho^2(\sum\limits_{u \in U}\dfrac{u}{|U|}, \sum\limits_{v \in V}\dfrac{v}{|V|})\]
  • Ward’s method: It is based on the increase in the sum of squared errors (ESS) that results from merging two clusters. It aims to minimize the variance within the newly formed cluster. The idea is to choose the pair of clusters for which the merging leads to the smallest increase in the total within-cluster sum of squares.

../_images/ward.jpeg
\[\hspace{2mm} R_{ward}(U, V)=\dfrac{|U|\cdot|V|}{|U|+|V|}\rho^2(\sum\limits_{u \in U}\dfrac{u}{|U|}, \sum\limits_{v \in V}\dfrac{v}{|V|})\]



At each step, it is good to be able to quickly calculate the distance from the formed cluster \(W = U \cup V\) to any other cluster \(S\), using known distances from previous steps. This is easily accomplished using the equation proposed by Lance and Williams in 1967:

\[R(W,S)=\alpha_{U}\cdot R(U,S)+\alpha_{V}\cdot R(V,S)+\beta\cdot R(U,V)+\gamma\cdot|R(U,S)−R(V,S)|\]
\[\alpha_{U}, \alpha_{V}, \beta, \gamma - \text{some parameters}\]

Pros and Cons#

Each method has its pros, as they have their cons. Firstly let’s you can see them bellow on table:

Method

Pros

Cons

Single linkage

• Can capture clusters of non-elliptical shapes,
such as elongated clusters or clusters with more complex geometries.
• Supports various distance metrics.

• Sensitive to noise and outliers.
• Prone to the “chaining” effect, where clusters
that are actually separate can be connected by a series
of intermediate points that act as bridges.

Complete linkage

• Less susceptible to noise and outliers.
• Supports various distance metrics.

• Could unite rather dissimilar groups at an early stage.
• Struggles to capture clusters of non-elliptical shape.

UPGMA

• Provides a balance between single and complete linkage.
• Less sensitive to noise and outliers as compared to single linkage.

• Biased towards globular clusters.

UPGMC

• Quite intuitive.
• Less sensitive to noise and outliers as compared to single linkage.

• Less suitable for capturing non-elliptical clusters.
• Can suffer from the “inversion phenomenon”, where
merging two clusters decreases their distance to other clusters.
• Supports only the Euclidean distance metric.

Ward’s method

• Tends to produce compact and equally-sized clusters (similar to k-means).
• Less sensitive to noise and outliers as compared to single linkage.

• Assumes that the clusters have approximately the same size.
• Less suitable for capturing non-elliptical clusters.
• Supports only the Euclidean distance metric.

There is no exact answer to the question of which linkage algorithm is better. Each of the distances listed above has its own disadvantages and is not suitable for all tasks. But in compliance with the experience of most data scientists, it has been proven that Ward’s method turned out to be the best according to the results of experimental comparison on a representative set of model problems. It recovers the intuitively best clustering more often than other methods.

Dendrogram#

As clusters merge, each iteration of the algorithm corresponds to a pair of clusters merged at this iteration, as well as the distance between the clusters at the moment of merging. Distances will only increase with iteration, so it becomes possible to visualize the result in the form of a beautiful cluster tree called a dendrogram.

On the diagram above, the synthetic data is shown on the left, and the dendrogram itself based on this data is on the right. On the dendrogram, sample objects are marked along the horizontal axis, distances between clusters are plotted vertically. Cluster merges correspond to horizontal lines.

We can set linkage distance threshold at or above which clusters will not be merged. Play with slider to understand how clusters change depending on threshold value.

Number of clusters

To determine the number of clusters, the interval of maximum length \(|R_{t+1}−R_{t}|\) is found. The final clusters are the clusters obtained at step \(t\). In this case, the number of clusters is equal to \(m−t+1\). However, when the number of clusters is unknown in advance and there are not very many objects in the sample, it can be useful to look at full dendrogram.

Example: Iris Dataset#

About Dataset: The Iris flower dataset is a classic dataset in the field of machine learning and statistical analysis. It consists of 150 observations of iris flowers, including the sepal and petal length and width for each flower, as well as the species of the flower.

About Features:

Name

Description

sepal_length

Sepal length, in centimeters, used as input.

sepal_width

Sepal width, in centimeters, used as input.

petal_length

Petal length, in centimeters, used as input.

petal_width

Petal width, in centimeters, used as input.

class

Iris Setosa, Versicolor, or Virginica, used as the target.

import plotly.express as px
from scipy.cluster.hierarchy import dendrogram, linkage

df_iris = px.data.iris()

dist_sin = linkage(df_iris.loc[:,["sepal_length","sepal_width","petal_length","petal_width"]], method="single")
plt.figure(figsize=(18,6))
dendrogram(dist_sin, leaf_rotation=90)
plt.xlabel('Index')
plt.ylabel('Distance')
plt.suptitle("Dendogram Single Method",fontsize=18)
plt.show()
../_images/aea20500d2f396532d2aed690ba4e59ef8f3a1a7ffb5b474596d7394ba43525b.png
dist_comp = linkage(df_iris.loc[:,["sepal_length","sepal_width","petal_length","petal_width"]],method="complete")

plt.figure(figsize=(18,6))
dendrogram(dist_comp, leaf_rotation=90)
plt.xlabel('Index')
plt.ylabel('Distance')
plt.suptitle("Dendogram Complete Method",fontsize=18) 
plt.show()
../_images/63a0416b754c4e56957c7b9717f86d5c53fec469d7d15b0126d4afae2780ae36.png

Example: Wine Dataset#

About Dataset: These data are the results of chemical analysis of wines grown in the same region of Italy, but obtained from three different varieties. As a result of the analysis, the amount of 13 components contained in each of the three types of wines was determined.

About Features:

Name

Description

Alcohol

Percentage of alcohol in wine. This parameter measures the quantitative ethanol content of the wine and affects its strength.

Malic Acid

The amount of malic acid in wine. Malic acid gives wine freshness and brightness.

Ash

The amount of minerals (ash) in wine after the water has evaporated and the residues have been burned. It reflects the minerality of the wine.

Alcalinity of Ash

Alkalinity of ash in wine. Alkalinity measures the pH level of a wine and affects its flavor profile.

Magnesium

Amount of magnesium in wine. Magnesium is one of the trace elements that can affect the taste and aroma of wine.

Total Phenols

The total amount of phenolic compounds in wine. Phenols are antioxidants and can affect the taste and color of wine.

Flavanoids

The amount of flavonoids in wine. Flavonoids are also phenolic compounds and can contribute to the flavor and color of wine and also have antioxidant properties.

Nonflavanoid Phenols

Amount of non-flavonoid phenolic compounds in wine.

Proanthocyanins

The amount of proanthocyanidins in wine. Proanthocyanidins also belong to the group of phenolic compounds.

Color Intensity

The color intensity of a wine is measured as the absorption of light at a specific wavelength. This parameter is related to the depth of color of the wine.

Hue

The shade of a wine is measured on a color scale. This value can range from orange to purple and is related to the color subtlety of the wine.

OD280/OD315 of Diluted Wines

The optical density of wine at a specific wavelength. This parameter may be related to the wine’s content of anthocyanins (pigments that give wine its red color).

Proline

The amount of amino acid proline in wine. Proline can affect the texture and structure of wine.

#Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.preprocessing import StandardScaler
from scipy.cluster.hierarchy import linkage, dendrogram
import plotly.figure_factory as ff

#Importing dataset
df_wine = pd.read_csv("assets/wine-clustering.csv")
df_wine.head()
Alcohol Malic_Acid Ash Ash_Alcanity Magnesium Total_Phenols Flavanoids Nonflavanoid_Phenols Proanthocyanins Color_Intensity Hue OD280 Proline
0 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
2 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
3 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
4 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735

Let’s see the correlation matrix of a dataset.

Correlation matrix shows how much the features have a linear relationship (that is, whether the feature is a linear combination of another feature or not).

corr  = df_wine.corr()

fig = plt.subplots(figsize = (8,6))
sns.heatmap(corr, annot = True, fmt='.2f')
plt.title('Correlation matrix for Wine dataset', fontsize = 18, fontweight = 'bold')
plt.show()
../_images/e9c46eb8963e0e2af59963ddd5b728774cf9cafca0f1a72562c28117cae7f3e5.png

The highest correlation coefficient is 0.86, that is, as Total_Phennols increases, Flavanoisd increases.

There is also a negative correlation between some features (that is, a decrease in one feature entails an increase in another and vice versa).

Overall, the correlation between features is mostly in the range of -0.2 to 0.3.

Since there is no such thing as multicollinearity in this set, we will not remove variables and will work with the full data set.

Standardization of features#

scaler = StandardScaler()
df_2 = scaler.fit_transform(df_wine)
dist_comp = linkage(df_2, method="ward")

plt.figure(figsize=(18,6))
dendrogram(dist_comp, leaf_rotation=90)
plt.xlabel('Index')
plt.ylabel('Distance')
plt.suptitle("Dendrogram using Ward Method", fontsize=18) 
plt.show()
../_images/2730f1fe6732d6b64cb99ac40c4ab27c5417233173f4da8431ee371f1bdf9f7d.png

The Ward’s method seeks to minimize the increase in the total within-cluster variance after merging clusters. It tends to form clusters with more uniform dispersion within the cluster. As can be seen from this dendrogram, it is best to divide this data set into 3 clusters (that is, define 3 classes). We will identify a model that will have the following hyperparameters:

n_clusters = 3, linkage = ‘ward’

And let’s make a plot to see our clusters:

import hvplot.pandas
import panel as pn
from sklearn.cluster import AgglomerativeClustering 
hc = AgglomerativeClustering(n_clusters = 3, linkage ='ward')

y_hc = hc.fit_predict(df_wine)
df_wine['cluster'] = pd.DataFrame(y_hc)

x = pn.widgets.Select(name='x', value='Proline',
                      options=['Alcohol', 'Malic_Acid','Ash','Ash_Alcanity',
                               'Magnesium','Total_Phenols', 'Flavanoids', 
                               'Nonflavanoid_Phenols', 'Proanthocyanins','Color_Intensity','Hue',
                               'OD280', 'Proline'])
y = pn.widgets.Select(name='y', value='Color_Intensity',
                      options=['Alcohol', 'Malic_Acid','Ash','Ash_Alcanity',
                               'Magnesium','Total_Phenols', 'Flavanoids', 
                               'Nonflavanoid_Phenols', 'Proanthocyanins','Color_Intensity','Hue',
                               'OD280', 'Proline'])

kind = pn.widgets.Select(name='kind', value='scatter', 
                         options=['scatter'])

by_cluster = pn.widgets.Checkbox(name='cluster', value = True)
color = pn.widgets.ColorPicker(value='#0000ff')

@pn.depends(by_cluster, color)
def by_cluster_name(by_cluster, color):
    return 'cluster' if by_cluster else color

plot = df_wine.hvplot(x=x, y=y, kind=kind, c=by_cluster_name, colorbar=True, width=600, title = 'Interactive plot to see clusters variation based on features')

pn.Row(pn.WidgetBox(x, y, kind, color, by_cluster), plot)

You’re free to make experiments, and see how clusters look with each feature of dataset

Performance evaluation indices#

Silhouette Score

Silhouette Score is a widely used metric to evaluate the quality of clustering results. It measures how similar a data point is to its own cluster compared to other clusters. The score ranges from -1 to 1, with a higher value indicating better clustering performance.

Kalinski-Harabasz Index

The Kalinski-Harabasz index, also known as the coefficient of variance test, is another metric to evaluate clustering performance. It measures the ratio of variance between clusters to variance within a cluster. A higher Kalinski-Harabasz index value indicates better clustering performance with higher separation between clusters and less variance within clusters.

from sklearn.metrics import silhouette_score
from sklearn.metrics import calinski_harabasz_score

hc.fit(df_wine)
labels = hc.labels_
labels

silhouette = silhouette_score(df_wine, labels)
chi = calinski_harabasz_score(df_wine, labels)
print('Silhouette Score', round(silhouette,3))
print('Kalinski-Harabasz Index', round(chi,3))
Silhouette Score 0.564
Kalinski-Harabasz Index 552.856